AITopics | alignment research

Collaborating Authors

alignment research

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

Misalignment or misuse? The AGI alignment tradeoff

Hellrigel-Holderbaum, Max, Dung, Leonard

arXiv.org Artificial IntelligenceJun-5-2025

Creating systems that are aligned with our goals is seen as a leading approach to create safe and beneficial AI in both leading AI companies and the academic field of AI safety. We defend the view that misaligned AGI - future, generally intelligent (robotic) AI agents - poses catastrophic risks. At the same time, we support the view that aligned AGI creates a substantial risk of catastrophic misuse by humans. While both risks are severe and stand in tension with one another, we show that - in principle - there is room for alignment approaches which do not increase misuse risk. We then investigate how the tradeoff between misalignment and misuse looks em pirically for different technical approaches to AI alignment. Here, we argue that many current alignment techniques and foreseeable improvements thereof plausibly increase risks of catastrophic misuse. Since the impacts of AI depend on the social context, we close by discussing important social factors and suggest that to reduce the risk of a misuse catastrophe due to aligned AGI, techniques such as robustness, AI control methods and especially good governance seem essential.

large language model, machine learning, reinforcement learning, (23 more...)

arXiv.org Artificial Intelligence

2506.03755

Country:

Europe > United Kingdom > England > Oxfordshire > Oxford (0.14)
Europe > Latvia > Lubāna Municipality > Lubāna (0.04)
North America > United States > California > Santa Clara County > Palo Alto (0.04)
(4 more...)

Genre: Research Report (0.64)

Industry:

Government (1.00)
Health & Medicine (0.67)
Information Technology > Security & Privacy (0.46)

Technology:

Information Technology > Artificial Intelligence > Robots (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
(3 more...)

Add feedback

The Economics of p(doom): Scenarios of Existential Risk and Economic Growth in the Age of Transformative AI

Growiec, Jakub, Prettner, Klaus

arXiv.org Artificial IntelligenceMar-10-2025

Recent advances in artificial intelligence (AI) have led to a diverse set of predictions about its long-term impact on humanity. A central focus is the potential emergence of transformative AI (TAI), eventually capable of outperforming humans in all economically valuable tasks and fully automating labor. Discussed scenarios range from human extinction after a misaligned TAI takes over ("AI doom") to unprecedented economic growth and abundance ("post-scarcity"). However, the probabilities and implications of these scenarios remain highly uncertain. Here, we organize the various scenarios and evaluate their associated existential risks and economic outcomes in terms of aggregate welfare. Our analysis shows that even low-probability catastrophic outcomes justify large investments in AI safety and alignment research. We find that the optimizing representative individual would rationally allocate substantial resources to mitigate extinction risk; in some cases, she would prefer not to develop TAI at all. This result highlights that current global efforts in AI safety and alignment research are vastly insufficient relative to the scale and urgency of existential risks posed by TAI. Our findings therefore underscore the need for stronger safeguards to balance the potential economic benefits of TAI with the prevention of irreversible harm. Addressing these risks is crucial for steering technological progress toward sustainable human prosperity.

extinction, scenario, tai, (13 more...)

arXiv.org Artificial Intelligence

2503.07341

Country:

Europe > United Kingdom > England > Oxfordshire > Oxford (0.14)
North America > United States > New York (0.04)
North America > United States > California (0.04)
(10 more...)

Genre: Research Report > New Finding (1.00)

Industry:

Government > Military (1.00)
Banking & Finance > Economy (1.00)

Technology:

Information Technology > Artificial Intelligence > Issues > Social & Ethical Issues (1.00)
Information Technology > Artificial Intelligence > Robots (0.93)
Information Technology > Artificial Intelligence > Natural Language (0.68)
(2 more...)

Add feedback

The AI Alignment Paradox

Communications of the ACMFeb-5-2025, 18:35:01 GMT

The release of GPT-3, and later ChatGPT, catapulted large language models from the proceedings of computer science conferences to newspaper headlines across the globe, fueling their rise to one of today's most hyped technologies. The public's awe about GPT-3's knowledge and fluency was quickly blemished by concerns regarding its potential to radicalize, instigate, and misinform, for example, by stating that Bill Gates aimed to "kill billions of people with vaccines" or that Hillary Clinton was a "high-level satanic priestess."4 These shortcomings, in turn, have sparked a surge in research on AI alignment,7 a field aiming to "steer AI systems toward a person's or group's intended goals, preferences, and ethical principles" (definition by Wikipedia). A well-aligned AI system will "understand" what is "good" and what is "bad" and will do only the "good" while avoiding the "bad."a The resulting techniques, including instruction fine-tuning, reinforcement learning from human feedback, and so forth, have contributed in major ways to improving the output quality of large language models.

large language model, machine learning, natural language, (18 more...)

Communications of the ACM

Country: Europe > Ukraine (0.08)

Industry: Government (0.36)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Elephant in the Room: Unveiling the Impact of Reward Model Quality in Alignment

Liu, Yan, Yi, Xiaoyuan, Chen, Xiaokang, Yao, Jing, Yi, Jingwei, Zan, Daoguang, Liu, Zheng, Xie, Xing, Ho, Tsung-Yi

arXiv.org Artificial IntelligenceSep-26-2024

The demand for regulating potentially risky behaviors of large language models (LLMs) has ignited research on alignment methods. Since LLM alignment heavily relies on reward models for optimization or evaluation, neglecting the quality of reward models may cause unreliable results or even misalignment. Despite the vital role reward models play in alignment, previous works have consistently overlooked their performance and used off-the-shelf reward models arbitrarily without verification, rendering the reward model ``\emph{an elephant in the room}''. To this end, this work first investigates the quality of the widely-used preference dataset, HH-RLHF, and curates a clean version, CHH-RLHF. Based on CHH-RLHF, we benchmark the accuracy of a broad range of reward models used in previous alignment works, unveiling the unreliability of using them both for optimization and evaluation. Furthermore, we systematically study the impact of reward model quality on alignment performance in three reward utilization paradigms. Extensive experiments reveal that better reward models perform as better human preference proxies. This work aims to awaken people to notice this huge elephant in alignment research. We call attention to the following issues: (1) The reward model needs to be rigorously evaluated, whether for alignment optimization or evaluation. (2) Considering the role of reward models, research efforts should not only concentrate on alignment algorithm, but also on developing more reliable human proxy.

large language model, machine learning, natural language, (18 more...)

arXiv.org Artificial Intelligence

2409.19024

Country:

Asia > China > Hong Kong (0.04)
North America > Mexico > Mexico City > Mexico City (0.04)

Genre: Research Report (0.82)

Industry: Health & Medicine (0.68)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Cyborgism - LessWrong

#artificialintelligenceFeb-16-2023, 17:35:36 GMT

It pursues this goal without further human intervention. For example, we create an AI that wants to stop global warming, then let it do its thing. Genie: An AI that follows orders. For example, you could tell it "Write and send an angry letter to the coal industry", and it will do that, then await further instructions.

agent, alignment research, gpt, (15 more...)

#artificialintelligence

Technology:

Information Technology > Artificial Intelligence > Natural Language > Chatbot (0.96)
Information Technology > Artificial Intelligence > Cognitive Science (0.93)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.71)
(2 more...)

Add feedback

Conditioning Predictive Models: Risks and Strategies

Hubinger, Evan, Jermyn, Adam, Treutlein, Johannes, Hudson, Rubi, Woolverton, Kate

arXiv.org Artificial IntelligenceFeb-6-2023

Our intention is to provide a definitive reference on what it would take to safely make use of generative/predictive models in the absence of a solution to the Eliciting Latent Knowledge problem. Furthermore, we believe that large language models can be understood as such predictive models of the world, and that such a conceptualization raises significant opportunities for their safe yet powerful use via carefully conditioning them to predict desirable outputs. Unfortunately, such approaches also raise a variety of potentially fatal safety problems, particularly surrounding situations where predictive models predict the output of other AI systems, potentially unbeknownst to us. There are numerous potential solutions to such problems, however, primarily via carefully conditioning models to predict the things we want (e.g. humans) rather than the things we don't (e.g. malign AIs). Furthermore, due to the simplicity of the prediction objective, we believe that predictive models present the easiest inner alignment problem that we are aware of. As a result, we think that conditioning approaches for predictive models represent the safest known way of eliciting human-level and slightly superhuman capabilities from large language models and other similar future models.

data mining, large language model, machine learning, (22 more...)

arXiv.org Artificial Intelligence

2302.00805

Country:

North America > Canada > Ontario > Toronto (0.14)
North America > United States > New York (0.04)
North America > United States > California > Santa Clara County > Palo Alto (0.04)
(5 more...)

Genre:

Instructional Material (0.67)
Research Report > Promising Solution (0.34)

Industry:

Government (1.00)
Leisure & Entertainment > Games (0.67)

Technology:

Information Technology > Modeling & Simulation (1.00)
Information Technology > Data Science > Data Mining (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
(3 more...)

Add feedback

Methodological reflections for AI alignment research using human feedback

Hagendorff, Thilo, Fabi, Sarah

arXiv.org Artificial IntelligenceDec-22-2022

The field of artificial intelligence (AI) alignment aims to investigate whether AI technologies align with human interests and values and function in a safe and ethical manner. AI alignment is particularly relevant for large language models (LLMs), which have the potential to exhibit unintended behavior due to their ability to learn and adapt in ways that are difficult to predict. In this paper, we discuss methodological challenges for the alignment problem specifically in the context of LLMs trained to summarize texts. In particular, we focus on methods for collecting reliable human feedback on summaries to train a reward model which in turn improves the summarization model. We conclude by suggesting specific improvements in the experimental design of alignment studies for LLMs' summarization capabilities.

large language model, machine learning, natural language, (19 more...)

arXiv.org Artificial Intelligence

2301.06859

Country:

Europe > Germany > Baden-Württemberg > Tübingen Region > Tübingen (0.04)
North America > United States > New York > New York County > New York City (0.04)
North America > United States > California > San Diego County > San Diego (0.04)
Europe > United Kingdom > England > Oxfordshire > Oxford (0.04)

Genre: Research Report (1.00)

Technology:

Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.99)

Add feedback

Artificial Persuasion Takes Over the World

#artificialintelligenceSep-28-2022, 08:55:10 GMT

Blurb: Narrates a fictional future where persuasive Artificial General Intelligence (AGI) goes rogue. Inspired in part by the AI Vignettes Project. A fondness for irony will help readers. "AI-powered memetic warfare makes all humans effectively insane." You can't trust any content from anyone you don't know. Phone calls, texts, and emails are poisoned. But the current waste and harm from scammers, influencers, propagandists, marketers, and their associated algorithms are nothing compared to what might happen. Coming AIs might be super-persuaders, and they might have their own very harmful agendas. People being routinely unsure of what's reality is one bad outcome, but there are others worse.

guru, happyplace, persuasion, (14 more...)

#artificialintelligence

Industry:

Information Technology > Security & Privacy (0.49)
Law Enforcement & Public Safety > Crime Prevention & Enforcement (0.34)

Technology: Information Technology > Artificial Intelligence > Cognitive Science (0.34)

Add feedback

Our approach to alignment research

#artificialintelligenceAug-25-2022, 01:40:29 GMT

Our approach to aligning AGI is empirical and iterative. We are improving our AI systems' ability to learn from human feedback and to assist humans at evaluating AI. Our goal is to build a sufficiently aligned AI system that can help us solve all other alignment problems. Our alignment research aims to make artificial general intelligence (AGI) aligned with human values and follow human intent. We take an iterative, empirical approach: by attempting to align highly capable AI systems, we can learn what works and what doesn't, thus refining our ability to make AI systems safer and more aligned.

ai system, alignment problem, alignment research, (14 more...)

#artificialintelligence

Technology:

Information Technology > Artificial Intelligence > Machine Learning (0.74)
Information Technology > Artificial Intelligence > Natural Language (0.54)

Add feedback

Scary AI Is More "Fantasia" Than "Terminator" - Issue 58: Self

NautilusMar-16-2018, 03:29:22 GMT

When Nate Soares psychoanalyzes himself, he sounds less Freudian than Spockian. As a boy, he'd see people acting in ways he never would "unless I was acting maliciously," the former Google software engineer, who now heads the non-profit Machine Intelligence Research Institute, reflected in a blog post last year. "I would automatically, on a gut level, assume that the other person must be malicious." It's a habit anyone who's read or heard David Foster Wallace's "This is Water" speech will recognize. Later Soares realized this folly when his "models of other people" became "sufficiently diverse"--which isn't to say they're foolproof, he wrote in the same post.

ai system, artificial intelligence, machine learning, (14 more...)

Nautilus

Industry: Information Technology (0.31)

Technology:

Information Technology > Communications > Social Media (0.68)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.31)

Add feedback